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Abstract 

The efficient utilization of underutilized spectrum is the 
main theme of current research. The cognitive radio with 
the help of Q-learning algorithm is used to detect the 
presence of primary signals and utilize the spectrum in the 
absence of primary signals. The proposed Q-learning 
algorithm model identifies previously known signals and 
learns to detect the signals which otherwise could not be 
detected , and helps for efficient utilization of spectrum. 
The simulations are further confirmed with results 
obtained through MATLAB gatool. 

Key words - cognitive radio, channels, reinforcement 
learning, primary user, genetic algorithm 

1. Introduction 

The cognitive radio (CR) is an emerging wireless 
technology for dynamic spectrum utilization and spectrum 
reconfiguration [1, 2, 3, 4]. The CR helps efficient 
spectrum utilization, interference avoidance, better system 
performance, and spectrum sensing followed by 
adaptation. The CR can be used as reconfigurable 
communication device. The CR device may be installed at 
the base station to detect the presence of other users in the 
environment. The CR will also be used to alter the 
frequency, power, modulation, coding, and other 
transmission parameters in real time. Spectrum detection 
by CR in real time is based on passive sensing and geo¬ 
location with spectrum usage database [15, 16, 17, 18, 19]. 

Presently, cellular systems are based on fixed channel 
allocation system [7] called static allocation. Fixed 
channel allocation is that a set of channels are partitioned 
and these partitions are assigned to cells. The same 
frequency is allowed in different cells as long as they 
maintain without interference. When a call arrives and pre¬ 
assigned channel is unused then the channel will be 
assigned; otherwise the call will be terminated. So the 
static policy (fixed channel allocation system) cannot meet 
the real demand and most of the spectrum will be unused. 
The alternative and more efficient policy is dynamic 
allocation of spectrum [5, 6, 7, 8, 9]. To make efficient use 
of spectrum we divide the customers into two groups: high 
priority or primary users (PU) and low priority or 
secondary users (SU). The PU always gets the spectrum 
whenever the user wanted to use. In the absence of PU, the 


SU may be allowed to use the spectrum, so that the 
spectrum can used efficiently. That means, if a high 
priority user enters into the domain, the low priority user 
must vacate. This arrangement helps for efficient use of 
the spectrum, but many problems need to be solved. One 
of the problems is detecting the PU, so that the SU will be 
moved out. To detect the presence of PU we introduce an 
agent called cognitive radio (CR), which is situated at the 
base station and helps the SU to get access to the 
spectrum. 

The CR not only detects the presence of PU but also 
meets the daily requirements including selection of band, 
mode, format, and user communication requirements. The 
CRs are installed at SU level to provide better facilities at 
SU level without interference at the PU. There are four 
points outlined in protection of PUs. 

• Create no-talk zone between PUs and SUs, where 
SUs will be silent. The size of the no-talk zone 
depends on the CR’s maximum transmit power. If a 
CR cannot tell where the primary system’s 
receivers are located, SUs must stay out of the area 
that is the union of all possible no-talk zones. 

• Sometimes the no-talk zone needs to be pushed 
further to avoid the interference for PUs. The 
border of no-talk zone must be placed outside the 
decode radius. 

• In the absence of PUs, the SUs must respect the 
protected area. 

• New challenges and new opportunities arise in the 
presence of multiple users. 

The CR technology at SU level will efficiently utilize 
the unused spectrum at any given time and geographical 
location [12, 13, 14]. The basic idea is CR terminal can 
sense the environment and location and adapt its features 
(power, frequency, modulation, etc.) and allow 
dynamically reuse the available spectrum. The 
reinforcement algorithm is used to learn without trained 
examples. It senses the signals provided as part of its input 
and new user signal entering the domain which has not 
seen before. The basic principle of reinforcement 
algorithm is learning through experience, through bidding 
environment, and reward for winning and penalty for loose 
[ 10 , 11 ]. 

Two main problems arise in recognizing the PU entering 
into the domain: 
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• failure to detect the presence of PU 

• false detection of PU when user does not exist 
(enter into the domain). 

To avoid the failure to detect the PU signal or false 
detection of PU signal, we may take the help of existing 
literature which solves similar problems. One of the 
solutions closely related to the current problem is Q- 
leaming model in reinforcement learning. The Q-leaming 
algorithm solves the problem of detecting the unknown 
signal entry while entering into the network. Once signal 
is detected store it in the database, so that similar signal 
will be detected next time if it occurs. 

2. Related Work 

The detection of the weak signals from primary 
transmitters was studied recently by Hoven [17] and 
showed that signal detection is very difficult if there is 
uncertainty in the receiver noise variance. The authors 
suggested the detection will be improved through the use 
of pilot tone from the primary transmitter. Sahai and 
Hoven [18] discussed that if the location of primary 
receivers is unknown, the cognitive radio rely on the weak 
primary signal to make decisions. Wild et al. [12] took the 
advantage of local oscillator [LO] leakage power emitted 
by the radio frequency (RF) front end to locate the primary 
receivers and guarantee that CR will not interfere with 
primary receivers. Wild et al. also proposed a low cost 
sensor based cognitive radio system architecture close to 
primary receivers to detect the LO leakage. Haartsen [20] 
proposed that CR techniques are not recommended to 
deploy in bands where interference sensitive or licensed 
QoS (quality of service) operations are generally assumed, 
since the signal levels are expected to be low. Haartsen 
also suggested that without prior knowledge of the 
existing service signal signature, it will be very hard for 
CR to detect such signals. So it requires a new 
methodology to identify such signals. 

The identification of the primary user is based on the 
signal characteristics. Cabric [2] showed that the detection 
of weak signal possibly improved significantly with the 
cooperation of CRs. Cabric discussed three techniques, 
namely, matched filter, energy detector and cyclo¬ 
stationary feature detection, to identify the primary signal 
through the received signal strength. In matched filter, the 
CR needs the prior knowledge of the primary signal at 
both MAC and PHY layers. The disadvantage is that CR 
requires a dedicated receiver for every primary user of a 
class. The matched filter can be simplified through energy 
detection which uses Fast Fourier Transformation. The 
disadvantages of implementing energy detectors are: they 
are highly susceptible to unknown or changing noise 
levels. Second, the energy detectors do not differentiate 
between modulated signals, noise and interference. Third, 
the energy detectors do not work for spread spectrum 


signals. Cyclo-stationary feature detection can be used for 
detection of a random signal with particular modulation 
type in a background noised and other modulation signals. 
The cyclostationary signals exhibit correlation between 
widely spread spectral components due to spectral 
redundancy caused by periodicity. In spite of these 
techniques, the cooperative spectrum sensing may be 
useful if the neighboring cognitive radios were 
programmed (calculate the overheads) to provide the 
needed information. 

The primary user signal is detected by its signal 
characteristics or signature [12, 13]. The current artificial 
intelligence (Al) techniques using rule-based systems, 
neural networks, and stochastic models will help to detect 
the signals with known signature. But current methods 
may have problems to detect the signals deviated from 
known signature. Most of the wireless signatures have 
either static signatures (previously known) or dynamic 
signatures (deviated from known signature). The hard to 
detect signatures require many times cryptographic models 
or prepare the experimental models to detect them. 

In order to communicate successfully, the radio must 
first be configured to fit the specific channel; second, the 
radio must support user required service types, like voice 
or data; third, above everything else are the regulatory 
requirements the radio must obey to operate legally in any 
band and geographic location. To provide best 
performance the radio needs cognitive engine (CE) to 
analyze the physical link, user demands, and regulatory 
regimes and it must balance multiple objectives and 
constraints. Maldonado [21] showed that genetic 
algorithms are well suited to solve multi-objective 
optimization and decision problems (MOGA- multi¬ 
objective GA) and the preliminary experiment showed 
improvement of 20 dB in the ISM band’s SINR (Signal to 
Noise Ratio) using CR technique. Rondeau [22] used GA 
approach for wireless system and showed that the set of 
radio parameters designed as genes optimize the user’s 
current needs. 

3. Reinforcement Learning and Q-Algorithm 

Considerable prior knowledge is required for problems 
whose solutions optimize an objective function defined 
over multiple steps. Dynamic programming techniques are 
able to solve such multistage sequential decision problems 
with complete knowledge of state space including 
transition probabilities. But, in most real problems, state 
transition probabilities are not known. The problems with 
lack of knowledge will be dealt with reinforcement 
learning by using each sequence state, action, and 
resulting state. The current state information is used to 
incrementally learn the correct value function, which 
requires the large number of samples. The learning 
requires long learning process and could not be solved by 
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dynamic programming methods. Reinforcement learning 
(RL) helps in such situations. 

Reinforcement learning (RL) is a way of decision 
making agents with optimal policies by assigning rewards 
and punishments for their actions based on the temporal 
feedback obtained during active interactions of the agent 
with the system environment. In the RL model depicted in 
Figure 1, a learning agent selects an action for the system 
that leads the system along a unique path till another 
decision making state is encountered. At this time, the 
system needs to consult with the learning agent for the 
next state. During a state transition, the agent gathers 
information about the new state, immediate reward and the 
time spent during the state-transition, based on which the 
agent updates its knowledgebase using an algorithm and 
selects the next action. The process is repeated and the 
learning agent continues to improve the performance. The 
whole process steps are part of the Q-Learning algorithm 
or reinforcement learning algorithm. The details of the 
algorithm are available in Algorithm 1. 



Figure 1. Find a control policy that will maximize the 
observed rewards over the lifetime of the agent 

Algorithm 1 

Let s -> state, a -> corresponding action, and r -> reward 
Repeat the following steps till no negative power gain: 

1. Select an action a at state s and execute it 

2. Receive immediate reward r 

3. Observe the new state s’ and action a’ 

4. Update the table entry (of rewards and states) for 
Q(s,a) as: 

Q(s,a) <- r + y max Q(s’, a’) 
s <- s’ where y is the learning rate (0<=y < 1) 

Update the Q table with the new observed signal. 

4. Simulations 

The Algorithm 1 was implemented using the MATLAB 
2006b version. The basic idea of RL is store Q-factor for 


each state-action pair in the system. Thus, Q (i, a) will 
denote the Q-factor for state i and action a. The values of 
Q-factors are initialized to arbitrary numbers in the 
beginning. Then the system is simulated using the 
algorithm. An action is selected in each visited state, and 
immediate reward or penalty will be awarded. The reward 
or penalty in the transition is recorded as feedback. The 
feedback is used to update the Q-factor for the action 
selected in the previous state. The process will continue 
for large number of transitions. At the end of the phase, 
called learning phase, the action for Q-factor has highest 
value and is declared as optimal action for that state. 

Most methods for approximating the value function in 
RL are intuitively represented as matrices. In the current 
simulation environment, we divide the spectrum (channels) 
into a square matrix, where each element of the matrix 
denotes a channel. If the value of an element in the matrix 
is negative means there is no activity or no signal. The 
square matrix is developed by introducing dummy states. 
If the number of channels is 99, then make a channel 100 
as a dummy channel with negative value. Then the size of 
the square matrix will be with 10 rows and 10 columns. 
The minimum value or strength of a signal starts from the 
value 0 (-0 is also considered as 0). In the second case, we 
assumed signal value 0 means no signal and the results of 
implementation is provided in Table IB. The Q-Learning 
algorithm was run with 4 channels (two rows and two 
columns) to 900 channels (30 rows and 30 columns). The 
learning rate y is selected from 0.2 to 0.8. The table 1A 
and Table IB are shown with y =0.2, where the results are 
reasonably acceptable. In the first step considering signal 
value 0 as an indication of minimum possible signal (see 
also the output for 64 channels in Table 1A) and tested 
with four signals (two rows and two columns). For 
example if the input to the Q-Algorithm (Algorithm 1) is: - 
2, 2, -10, 5 with learning factor y=0.8 the output of the 
Algorithm generated is (optimum value generated in the 
Q-table) given below: 

0 , 88 , 0 , 100 

That is, the optimum values of the state are calculated 
and stored in the Q-table (database) for each state. Since 
the value of each state provides the optimum value for that 
state, the signal detection entirely depends upon the value 
at the state, because the value is now available as known 
signal value for the system (CR). The current calculated 
signal values are combined with the previous known 
signals, stored in Q-table (Q-database) and then continue 
the process to detect if any new signal enters in the domain 
of the primary signal area. Hence it is concluded that, 
using the reinforcement learning algorithm, the incoming 
signals will be processed, stored in the Q-table, and 
combined as part of the previously known signal, which 
then will be used as part of the known signals to identify 
the incoming PU signals appropriately. 
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In the second example, we consider 64 channels (Table 
1) for input to the Algorithm 1. The negative values (no 
signal strength) and positive values (with signal strength) 
as input are shown in Table 1 and the corresponding 
output is provided in Table 1A. Alternatively, the signal 
value 0 is taken as no signal and the output is shown in 
table IB. 


Table 1: Input to the Algorithm 1 


-3 

-9 

2 

-100 

0 

-300 

0 

-1 

-20 

-50 

-60 

7 

-4 

7 

10 

-6 

-70 

-20 

-1 

100 

-40 

50 

8 

-9 

-3 

7 

3 

-9 

0 

-50 

10 

-30 

9 

-30 

-22 

70 

-55 

10 

7 

-20 

-44 

40 

-55 

-88 

60 

10 

60 

-43 

-12 

-32 

33 

7 

-77 

50 

10 

-66 

50 

0 

0 

10 

78 

-22 

80 

55 


After execution of the Q-Algorithm (Algorithm 1), the 
state of the q-table is shown in Table 1A and Table IB: 


Table 1A: Input to the Algorithm 1 


0 

0 

21 

0 

14 

0 

12 

0 

0 

0 

0 

11 

0 

20 

21 

0 

0 

0 

0 

100 

0 

62 

20 

0 

0 

11 

22 

0 

14 

0 

21 

0 

12 

0 

0 

71 

0 

23 

19 

0 

0 

42 

0 

0 

71 

23 

69 

0 

0 

0 

51 

11 

0 

62 

21 

0 

52 

4 

20 

14 

88 

0 

88 

70 


Table IB: Input to the Algorithm 1 


0 

0 

21 

0 

0 

0 

0 

0 

0 

0 

0 

11 

0 

20 

21 

0 

0 

0 

0 

100 

0 

62 

20 

0 

0 

11 

22 

0 

0 

0 

21 

0 

12 

0 

0 

71 

0 

23 

19 

0 

0 

42 

0 

0 

71 

23 

69 

0 

0 

0 

51 

11 

0 

62 

21 

0 

52 

0 

0 

14 

88 

0 

88 

70 


The table 1A and Table IB values show that state values 
are updated to their optimum values (improved the signal 
strength to the level of detection), which further helps to 
detect the signal detection. The important point is to know 
about the difference between algorithm output of Table 1A 
(0 considered as some signal strength) and Table IB (0 
considered as no signal strength). In table 1A the input 
value of signal 0 is considered as some indication of signal, 


whereas in table IB the input value 0 is considered as no 
signal. 

To understand the tables see the corresponding values of 
Table 1 and Table 1A. The negative values are zeros in 
Table 1A and positive values are improved to higher 
frequency status, so that the signal can be represented. If a 
signal is represented clearly, it will be identified without 
any failure. 

Therefore, the results through tables show that using 
Algorithm 1 it is possible to detect all possible signals for 
CR using Q-algorithm of RL model. The current research 
helps for successful detection of the presence of PU. The 
characteristics of the signals and visual demo will be 
considered for future research. 

The work of detecting the primary receivers using CR 
technique is in its initial stage and very few references are 
available in literature for comparison. One reference is 
Wild’s [12] proposal of Local Oscillator leakage (LO) to 
detect primary signal. Ben Wild took the advantage of LO 
power that radio frequency allows CR to locate these 
receivers and showed the approach is useful in detecting 
PU. In the current research we used the RL to detect the 
presence of PU signals, where the approach (detection 
process) is different from Wild’s experimental process, 
and it is not possible to compare with Wild’s results. In the 
current research, the Algorithm 1 helps for unsupervised 
learning to detect the minimum strength signal and then 
stores in Q-table for future reference. The algorithm also 
helps to detect the absence of the signal, which is a 
negative value as input in Table 1 and results 0 as output 
in Table 1A as well as Table IB. 

The results are further verified using the MATLAB 
gatool and found that the results are converging 
immediately (top left part of Figure 2) and distance 
between any two individuals is converging after 150 
generations or close to 200 generations (top right part of 
Figure 2). The main parameters used to initialize the 
gatool are given below: 

Population Type: Double Vector 
Population size: 30 
Creation function: Uniform 
Selection Function: Stochastic Uniform 
Reproduction: 

• Elite count: 2 

• Crossover fraction: 0.8 
Mutation function: Gaussian 
Crossover function: Scattered 
Migration direction: forward 
Algorithm Settings: 

• Initial Penalty: 10 

• Penalty factor: 100 

The results provided in Figure 2, was for 200 
generations (test was conducted with 100 generations to 
300 generations), with stall generations 500, stall time 
limit 2000. The immediate convergence (top left part of 
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Figure 2) and negligible difference between best and mean 
fitness shows the results are acceptable. As usual, genetic 
algorithm tool execution requires more time compared to 
conventional programming (in MATLAB language or 
Java or C), but produces near optimal solutions to any NP 
complete problems. 

The first part (top left) of the figure displays the fitness 
function (fitness function is required for the objective 
function that we want to minimize). Best fitness plots the 
best function value in each generation versus iteration 
number. The second part of the figure (top right) plots the 
average distance between individuals at each generation. 
The third part (bottom left) of the figure plots the expected 
number of children versus the raw scores at each 
generation. Finally, the fourth part (bottom right) plots a 
histogram of the scores at each generation. 

It is observed from Figure 2 that the average distance 
between two individuals is very close after 150 
generations (right top part of Figure 2). The gatool results 
support the Table 1A and Table IB of RL output since it 
converges immediately as shown in the first part of Figure 
2 and fitness of each individual is same (see part 4 of 
figure - right bottom). 


Best: 104.7917 Mean: 104.7917 


104.7917 

1 104.7917 

(/) 

CO 

5 104.7917 


8 

X 2 


Best fitness 
Mean fitness 


50 100 150 200 

Generation 
Fitness Scaling 


104 105 

Raw scores 


Average Distance Between Individuals 
1.5 



Figure 2. gatool output of Best Fitness, Distance 
between two Individuals, raw scores between two 
individuals, fitness of each individual. 

6. References 


5. Conclusion 

The proposed model using the Q-leaming algorithm 
helps to detect the primary signals. Appropriate detection 
of primary signals is very important for efficient 
utilization of the spectrum. Q-leaming algorithm learns 
basing on the data presented to the system (algorithm) by 
using reward and penalty principle. The algorithm was 
implemented and results were presented in Table 1A and 
Table IB. The results are further confirmed using the 
MATLAB gatool. The gatool results show the system 
converges after 150 generations (Figure 2 - top right). The 
gatool results of bottom right figure further show that the 
fitness of each individual is same at the end of 200 
generations, means the system converges and best possible 
chromosomes were generated. That is the best possible 
signal detection possible at that point. It is concluded that 
the model helps to detect the signals which are hard to 
detect. 
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